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Abstract 

We consider zero-delay single-user and multi-user source coding with 
average distortion constraint and decoder side information. The zero- 
delay constraint translates into causal (sequential) encoder and decoder 
pairs as well as the use of instantaneous codes. For the single-user setting, 
we show that optimal performance is attained by time sharing at most two 
scalar encoder-decoder pairs, that use zero-error side information codes. 
Side information lookahead is shown to useless in this setting. We show 
that the restriction to causal encoding functions is the one that causes 
the performance degradation, compared to unrestricted systems, and not 
the sequential decoders or instantaneous codes. Ftirthermore, we show 
that even without delay constraints, if either the encoder or decoder are 
restricted a-priori to be scalar, the performance loss cannot be compen- 
sated by the other component, which can be scalar as well without further 
loss. Finally, we show that the multi-terminal source coding problem can 
be solved in the zero-delay regime and the rate-distortion region is given. 

1 Introduction 

The classical source coding theorems and their converse counterparts provide 
fundamental limits which are usually asymptotic in the sense that they can 
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be achieved by systems that introduce delay (imposed by operating on blocks) 
and/or require complexity that grows exponentially. In many practical scenar- 
ios, however, delay is not tolerable. Such systems include many real-time source 
coding applications such as streaming live multimedia. In such applications, 
not only should the data be encoded and decoded in real-time, but also there is 
no way to improve the reconstruction of previously decoded data as new data 
arrive. While zero-delay is clearly motivated by fast real-time systems, it can 
also be motivated by extremely slow systems that observe data once in a long 
while and must act on this data before new data will arrive. This work focuses 
on the fundamental limits of such zero-delay systems with the addition of side 
information (SI) which is available to the decoder. 

Zero-delay operation means that each time a new source symbol is observed, 
a message must be sent to the decoder. The decoder must decode the message 
and reconstruct the source symbol before the next message will arrive. This, in 
turn, translates into three constraints: Primarily, the encoder functions must 
be causal functions of the source symbols. Secondly, the code with which the 
encoders sends the messages to the decoder must be instantaneous, meaning 
the decoder can detect the end of each codeword before the whole codeword of 
the next symbol will arrive. Alternatively, the decoder must be able to parse 
the bit-stream which is composed of the received codewords in a causal manner. 
Finally, the decoder must be a causal function of the encoder messages and its 
SI. 

In this work, we consider both single- and multi-user scenarios. We start 
by describing the single-user setting and related work and then continue to the 
multiterminal scenarios. In the single-user setting, we consider the following 
source coding problem: Symbols produced by a discrete memoryless source 
(DMS) are to be encoded, transmitted noiselessly and reproduced by a decoder 
without delay. The decoder has access to SI correlated to the source. The 
average distortion between the source and the reproduced symbols is constrained 
to be smaller than some predefined constant. 

When no distortion is allowed, this problem falls within the scope of zero- 
error source coding with SI, which was initially introduced by Witsenhausen in 
[1]. Witsenhausen considered fixed-length coding and characterized the side- 
information structure as a confusability graph defined on the source alphabet. 
With this characterization, fixed-length SI codes were equivalent to colorings 
of the associated graph. Alon and Orlitsky [5] considered variable-rate codes 
for the zero-error problem. Two classes of codes were considered and lower 
and upper bounds were derived for both the scalar and infinite block length 
regimes. The work of Alon and Orlitsky was further extended by Koulgi et. 
al [3], who showed that the asymptotic zero-error rate of transmission is the 
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complementary graph entropy of an associated graph. It was also shown in [5] 
that the design of optimal code is iVP-hard and a sub-optimal, polynomial time 
algorithm was proposed. The combination of zero-error codes and maximum 
per- letter distortion was considered in As we will see in the sequel, the zero- 
error source coding with SI is relevant to our setting as well, since the encoder 
can rely on the SI when sending its messages. 

When the source alphabet is finite and distortion is allowed, scalar quantizer 
design boils down to finding the best partition of the source alphabet into dis- 
joint subsets. The number of such subsets will be governed by the constraints 
imposed on the system (distortion, rate, encoder's output entropy etc.). In 
[Sj, Muresan and Effros proposed an algorithm for finding good partitions in 
various settings which include the variable-rate scalar Wyner-Ziv [6 setting. 
However, the optimality of the partitions relied on the convexity of the subsets. 
Namely, the subsets in each partition must be intervals in the source alpha- 
bet. It was noted by the authors that this requirement is too strong in the 
scalar Wyner-Ziv setting and there are many cases where the optimal partition 
contains non-convex subsets. We demonstrate such a scenario in the examples 
section of this work. Bounds on the performance of scalar, fixed-rate source 
codes with decoder SI were recently given in [7]. 

Real-time codes form a subclass of the class of causal codes, as defined by 
Neuhoff and Gilbert [S] . In [H] , entropy coding is used on the whole sequence of 
reproduction symbols, introducing arbitrarily long delays. In the real time case, 
entropy coding has to be instantaneous, symbol-by-symbol (possibly taking 
into account past transmitted symbols). It was shown in [S] that for a DMS, 
the optimal causal encoder consists of time-sharing between no more than two 
scalar encoders. In [S], Weissman and Merhav extended |B] to include SI at the 
decoder, encoder or both. The discussion in [3] was restricted, however, only 
to settings where the encoder and decoder could agree on the reconstruction 
symbol (i.e., the SI was used for compression, but not in the reproduction at 
the decoder) . Non-causal coding of a source when the decoder has causal access 
to SI (with possibly a finite look-ahead) was considered by Weissman and El- 
Gamal [lOj. Gaarder and Slepian [TT] gave structure theorems for fixed-rate 
encoders and decoders that are time-invariant finite-state devices. 

The results of [5] for causal coding can be adapted to zero-delay coding 
by replacing the arbitrary long delay entropy coding with zero-delay Huffman 
coding, thus showing that time-sharing at most two scalar quantizers, followed 
by Huffman coding, is optimal. When the SI is available to both the encoder 
and decoder, the results of [5] can be adapted to zero-delay in a similar manner, 
where at most two scalar quantizers followed by Huffman coding are used for 
every possible SI symbol. The setting where the decoder can use the SI both 
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to decode the compressed message and to reproduce the source was left open in 
[S]. As will be seen in the sequel, our results on zero-delay can be easily adapted 
back to the causal setting and answer some of the questions left open by [9] . 

This paper has several contributions. In the single-user setting, the first 
contribution is the extension of [8j to zero-delay with decoder SI and average 
distortion constraint, where unlike [9|, we do not restrict the usage of the SI. 
We show that results in the spirit of [8 continue to hold here in the sense that 
it is optimal to time-share at most two scalar encoders and decoders. However, 
unlike the encoders in [S] that use Huffman codes in the zero-delay setting, here, 
the encoders transmit their messages using zero-error SI instantaneous codes, 
as defined in [2 (and will be properly defined in the sequel). Secondly, we show 
that there is no performance gain if the decoder has non-causal access to the SI 
(lookahead) and in fact, only the current SI symbol is useful. This is in contrast 
to the arbitrary delay and causal SI setting of [1^ , where SI lookahead was shown 
to improve the performance. These results place the optimal performance of 
zero-delay systems far below the classical source coding results where arbitrary 
delay is allowed. We further ask which of the zero-delay constraints (causal 
encoder, instantaneous code and causal decoder) are causing this degradation 
in performance. It is shown that if we remove the constraint on the encoder and 
allow it to observe the whole sequence in advance but force instantaneous codes 
(restricting the number of bits in each transmission) and a causal decoder, at 
least in some cases, the classical rate-distortion performance can be obtained. 
This suggests that the "blame" for the relatively poor performance of the zero- 
delay systems falls, at least in these cases, on the restriction to causal encoding 
functions. The scheme we use to show the last point is surprisingly simple, but 
to the best of our knowledge it is novel nonetheless. Finally, we show that in 
the zero-delay setting, if we a-priori restrict attention to scalar decoders (that 
use only the current encoder message and SI symbol) , scalar encoders will do as 
well as encoders that observe the whole source sequence in advance. Similarly, if 
we restrict the encoders to be scalar, scalar decoders will do as well as decoders 
that introduce delay and generally use all the encoder messages to reproduce 
the symbols. This means that the simplicity of one of the components (encoder 
/ decoder) cannot be compensated by the complexity of the other component. 

Moving on to niultiterminal scenarios, the zero-delay constraint should be 
carefully defined. Specifically, suppose we have several non-cooperating encoders 
and one decoder. How does the decoder receive the bit-streams from each of the 
encoders and can it decode the message from one of the users before it starts 
decoding the other message? If the decoder must decode the messages simulta- 
neously, then each encoder should use an independent instantaneous code which 
can depend only on the past. However, if for example, one of the messages can 
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be decoded first (say, the messages are interleaved in one bit-stream), the first 
decoded message, as well as all past messages, can serve as SI when decoding 
the other message. This idea can be generalized to breaking the encoders mes- 
sages into small pieces and sending them according to a predefined protocol. 
We revisit the well known "multiterminal source coding problem" (two user 
Wyner-Ziv problem) , where a source emits correlated pairs and each element of 
the pairs is observed by a different encoder. Each encoder sends a codeword to 
a joint decoder which reconstructs the current pair. The distortion between the 
reconstructions and the original variables should not exceed a given threshold. 
The challenge here is to find the set of achievable rates and distortions. This 
problem remained an open problem for over three decades (relevant literature 
review for this problem will be given in Section |4| . We consider a zero-delay 
version of this problem, where both users and the decoder must operate with 
zero-delay, assuming that one of the users messages is decoded first and serves 
as SI to the other user (simultaneous decoding of both messages is a simpler 
problem and its solution follow immediately from the solution to our problem) . 
We show that, unlike the arbitrary delay setting, the zero-delay problem can be 
readily solved and the rate distortion region can be characterized. 

The remainder of this paper is organized as follows. In Section [2j we give 
the formal setting and notation used throughout the paper. Section |3] deals 
with single user problems. Multiterminal zero-delay source coding is handled in 
Section m We conclude this work in Section [Sj 

2 Preliminaries 

We begin with notation conventions. Capital letters represent scalar random 
variables (RV's) , specific realizations of them are denoted by the corresponding 
lower case letters, and their alphabet - by calligraphic letters. For a positive 
integers i, j, will denote the vector (xj,. . . , Xi). If j — 1, the subscript will be 
omitted. The probability distribution over a finite alphabet X will be denoted 
by Px(')- When there is no room for ambiguity, we will use P{x) instead of 
Px{x). 

For a joint distribution P{x, y), we say that x,x' £ X are confusable if there 
is a y e 3^ such that P{x,y) > and P{x',y) > 0. A characteristic graph G 
is defined on the vertex set of X and x,x' e X are connected by an edge if 
they are confusable. The pair {G,P), denotes a probabilistic graph consisting 
of G together with the distribution P over its vertices (here P denotes the 
marginal on X). We say that two vertices (x, x') are adjacent if there is an edge 
that connects them in G. The chromatic number of G, x{G), is defined to be 
the smallest number of colors needed to color the vertices of G so that no two 
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adjacent vertices share the same color. 

We focus only on {x, y) pairs with P(x, y) > and thus restrict attention only 
to restricted inputs (RI) protocols, as defined in |2j. A protocol for transmitting 
X when the decoder knows Y , henceforth referred to as an RI protocol, is 
defined to be a mapping (/) : A" — > {0, 1}* such that if x and x' are confusable 
then 4'{x) is neither equal to, nor a prefix of 4>{x'). An encoder that uses an RI 
protocol will be referred to as a Sl-aware encoder. The length in bits of 4){x) 
will be denoted by |0(a;)|. Note that for restricted inputs, the prefix condition 
should be kept only over edges of G. Namely, for every y £ the prefix 
condition should be kept over the subset {x : P{x,y) > 0}. The fact that the 
same x G A" can be contained in multiple such subsets, but can have only a 
single bit representation, complicates the search for the optimal RI protocol. 
Let = '^xex P(^)\'t'i-'^)\^ where the subscript emphasizes that Y is known 

to the decoder. Let 

Ly{X) — min {ly{4>) : (p is an RI protocol} . (1) 

In degenerate cases, where given a SI symbol there is only one possible source 
symbol with positive probability, no bits are needed to be sent and we set 
\4>{x)\ = 0. In this case, the decoder knows that the next message in the 
bit-stream will be that of the next source symbol, so synchronization will be 
maintained. Upper and lower bounds on Ly{X), in terms of the entropy of the 
optimal coloring, are given in [2j. There is no known closed form expression for 
Ly{X). We will use Ly{X) as a figure of merit and our results will be single- 
letter expressions in terms Ly{ )- In FigjTj we give an example of bipartite 
graphs, formed by two joint distributions P{x,y) where an edge connects {x,y) 
if y) > Q along with the characteristic graphs and the optimal RI protocols 
for a uniform Px{-)- 




[a] [b] 

Figure 1: Example of bipartite graphs of P{x,y) along with their associated 
characteristic graphs G and a RI protocol for 5 (a) and 6 (b) letter alphabets 
with "typewriter" SI. 

In FigjTj, we used 4 different bit representations for the source symbols. 
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These bit representations are not prefix-free, but are easily seen to be uniquely 
decodable with the SI. The optimal bit representations imply a 4-coloring scheme 
for G, although x(G) = 3. In Fig[I]3, however, x{G) = 2 and indeed the 
optimal RI protocol uses a 2-coloring scheme. In Section [3^ we will return to 
this example, as well as another example where increasing the quantizer output 
alphabet reduces the rate. 

When the graph G is complete, the prefix condition should be kept for all 
X G X, thus reducing the RI protocol to regular prefix coding. In this case, 
Ly{X) is equal to the average Huffman codeword length of X. This is, of 
course, also true when SI is not available at the decoder and we can think of the 
of a bipartite graph where all elements of X are connected to the same constant 
y. The average Huffman length of the source X will be denoted by L{X). 

In the proof the converses of our theorems, we use a "genie" that reveals 
common information to both encoder and decoder, thus we define a conditional 
RI protocol. Let the triplet {X, Y, Z) be distributed with some joint distribution 
P{x, y, z). The information that is known to both parties will be denoted by Z, 
while X,Y continue to play the roles of source output and the SI respectively. 
For any z ^ Z, let ^{z) denote the set of conditional RI protocols for z. Namely, 
the set of all RI protocols for source X and SI Y such that p{x, y, z) > 0. For 
any cj) £ let Ly(0|z) = X^ajsA" -^(^ I 1*^(^)1 average length when 

Z = z. Similarly, let 

Ly{X\Z ^ z)^ min{ly(0|z) : € $(2)} . (2) 

Finally, let Ly{X\Z) — ELy{X\Z = z) where the expectation is with respect 
to Pz{') and we used the same abuse of notation which is commonly used with 
the notation of conditional entropy. It follows that Ly{X\Z) < Ly{X) since 
the set of RI protocols which are valid without the common knowledge of Z 
is contained in the set of conditional RI protocols which are valid when Z is 
known at both ends. In the special case where Z = Y, i.e., the SI is known to 
both parties, the RI protocol for each y reduces to designing a Huffman code 
according to Px\Y{'\y) for every y £ y. We will denote conditional Huffman 
length by L{X\Z) and it is given by 

L(X|Z)=^P(z)inin^F(x|z);(x) (3) 

Z X 

where in the minimization we consider all functions I : X that satisfy 

Kraft's inequality. 
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3 Single-User Zero-Delay Problems 



We investigate the following zero-delay problem. A DMS is emitting pairs of 
random variables {Xt,Yt) according to P{x,y), The alphabets of Xt and Yt, X 
and y, as well as all alphabets in the sequel are finite. At time t, an encoder 
observes Xt and transmits a compressed codeword, Wt, to a decoder which also 
observes Yj. The decoder produces Xt € X ^ a. reproduction of Xt, where X 
is the reproduction alphabet. Given a constant D and a distortion measure 
d : X •>< X R+, it is required that limsup„_j.oQ ^^EY^^^^d{Xt, Xt) < D. 
Operation is in real-time. This means that the transmitted data, Wt, can be 
a function only of the encoder's observations no later than time t, namely, X*. 
Similarly, the decoder's estimate, Xt is a function of (VF*,y*). Let L„ denote 
the total number of bits sent after observing n source symbols. The rate of 
the encoder is defined by i? = limsup„_j.o^ ^EL^. Our goal is to find the 
tradeoffs between R and D. Since no delay is allowed, Wt must be encoded by 
an instantaneous code. The model is depicted in Fig[2] 



Encoder Decoder 





f(x-) 










01001... 





Figure 2: Basic model 



In most block coding schemes, there is a probability of failure of the coding 
system (for example, the probability that the input will not be typical in schemes 
based on typicality). In these schemes, this probability becomes small as the 
block length increases and thus effectively does not affect the average distortion. 
In the zero-delay regime, however, there is no equivalent to failure since the 
encoding is sequential. An error at time t will mean the decoder failed to parse 
the bit-stream sent by the encoder correctly. However, such an error event is 
catastrophic and will render the encoder future messages either meaningless or 
ambiguous. We therefore restrict our model to one where synchronization is 
maintained between the encoder and decoder at all times. This means that the 
encoder messages, Wt, are sent without error. Note that in general P{wt,yt) 
P{wt)P{yt) (since {Xt,Yt) are not independent) and therefore, we will need 
to consider instantaneous coding of Wt in the presence of correlated SI at the 
decoder. 
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We start with a fully zero-delay system and give the rate-distortion function 
for this system. Later, we relax the zero-delay constraint on the SI and allow 
lookahead, showing that there is no performance gain. Furthermore we give 
the optimal performance when one of the components (encoder or decoder) is 
forced to be scalar, while no constraints are imposed on the other components 



(decoder or encoder). Subsection 3.1 summarizes the main results and discusses 
the results of this section. Proofs of Theorem [2] and [3] are given in Subsections 



3.2 and 3.3 respectively. Examples of SI aware quantizers that result from these 



theorems are given in Subsection |3.4[ Finally, in Subsection |3.5[ we revisit the 
causal setting as treated in [8] and [9] and derive the equivalent of Theorem [2] 
for the causal setting, which was left open in [S]. 



3.1 Main Results 

A pair {R, D) is said to be achievable if there exists a rate-i? encoder with causal 
encoding functions Wt — ft{X*),t — 1,2,..., and a decoder with causal repro- 
duction functions, Xt = 9t{W* ,Y*), such that the average distortion is smaller 
than D. Let TZzd{D) denote the infimum over all rates that are achievable with 
a given D, where the subscript stands for zero-delay. Let 

RzD{D)=mmLY{f{X)) (4) 

h,f 

where the minimization is over all deterministic functions h : Z x y ^ X and 
f : X ^ Z such that Ed{X,h{Y,Z)) < D (obviously, \Z\ < \X\). Finally, 
denote the lower convex envelope of Rzd{D) by Rzoi^)- 
The first result of this paper is the following theorem: 

Theorem 1. 



TZzd{D) ^ Rzd{D). (5) 



This theorem is proved along with Theorem [2] in Subsection 3.2 
Remarks: 

1. Theorem [T] implies that optimal performance is attained by time-sharing at 
most two scalar Sl-aware quantizers along with scalar decoders. The role of the 
function / is to partition the source alphabet into subsets. Note that there is no 
sense in creating overlapping subsets since it will only increase the uncertainty 
at the decoder (increase the distortion) while adding edges to the characteristic 
graph of f{X) with Y (thus increasing the rate). For discrete X, there is a 
finite number of such partitions. For every possible partition, every choice of a 
possible h will define a point in the R~ D plane. This point will be calculated 
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by finding the best Rl-protocol for the partition defined by / and calculating the 
average distortion with (/, h). Since there is a finite number of such /, h, there 
are a finite number of points on the R — D plane. The lower convex envalope 
of these points gives us RzDi^)^ which will be piecewise linear. 

2. For a given D, the partition of the source alphabet with the minimal 
number of subsets which achieves distortion D is not necessarily optimal and 
increasing the number of subsets (thus enlarging the quantizer's alphabet size) 
can reduce the average rate. To see this, assume that we have found /, h where 
/ partitions the alphabet into the minimal number of subsets for this specific D. 
If at least one of these subsets contains more than one element, we can always 
find /' and h' such that /' will divide this subset into two disjoint subsets, 
while h' will be defined the same as h for the original subset and therefore 
/', h' will have the same average distortion as the original /, h. While it seems 
intuitive that the rate needed for /' will be higher that the rate needed for 
/, this is not necessarily true. The reason is that the characteristic graph of 
f{X) is fully connected, while this is not always the case for the characteristic 
graph of f'{X). To see why the characteristic graph of f{X) is fully connected, 
note that if two subsets are not connected in the characteristic graph of f{X), 
then they are not confusable. Therefore, we can combine these subsets into 
one subset without loss of performance since h can act differently for the Y's 
associated with the original subsets. But this contradicts the assumption that 
/ was the partition with the minimal number of subsets for the given D. If 
the characteristic graph of f'{X) is not fully connected, the SI can be used to 
alleviate the prefix conditions on the code for f'{X) and shorten its length. 
Therefore, increasing the output size of the quantizer might reduce the average 



length. We demonstrate such a phenomenon in Section 3.4.3 



3. There is no loss of generality in the restriction to deterministic encoders 
since Ly{Z) is a concave functional of {P(z|a;)} while the distortion is linear 
{P{z\x)}. Therefore, optimizing over the whole convex set of stochastic encoders 
(represented by distributions {P(2;|x)}) is equivalent to optimizing only over the 
extreme points of this set, which are the deterministic encoders. 

4. The result of this theorem relies heavily on the fact that no delay/ encoder 
lookahead is allowed. If, for example, the encoder will be allowed a lookahead 
of one symbol, i.e., Wt = /(X*+^), then one can think of the problem as a zero- 
delay coding of a Markov source, with symbols Xt — {XtTXt+i). The results 
presented in [T^] suggest that the optimal encoder in this case should use Xt 
and its previous message^ While [12] does not rule out the existence of simpler 

^Although the result of 1121 holds for a cost function which is a linear combination of the 
distortion and the average length, it can be adopted to the constrained setting as well when 
assuming that there is common randomness between encoder and decoder, see also |13l for an 
example. 
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optimal encoders, it seems intuitive that for Markov sources the encoder should 
not be scalar. 

Let TZ^jj{D) denote the infimum over all rates that are achievable with a 
given D, with the same encoders as before and decoders that can use the whole 
SI sequence, i.e., Xt = fft(W^*,i^")- We have the following theorem: 

Theorem 2. 

7^LP)=^ZDP)• (6) 

We prove Theorem [2] in Section [3^ Theorem [T] is a special case of Theorem 
[2] and the direct part of the proof, as apparent from the theorems, is the same 
for both theorems. 

The theorem states that allowing the decoder to observe the future SI sym- 
bols will not result in any performance gain. This is in contrast to the setting 
of non-causal access to the source and causal access to the SI at the decoder, 
treated in |10j . where it was shown that SI look-ahead can improve performance. 

Now assume that the encoder, in addition to the source Xt, observes causally 
another process, St € S, which is jointly distributed with the source and SI ac- 
cording to some joint distribution = P{St, Xt,Yt), i.e., 
the encoder has access to SI as well as the source. The goal remains to re- 
construct only the source Xt with distortion not exceeding D. Let us define 
'T^ZD,EncSi{D) to be the infimum over all rates that are achievable with a given 
D, where the encoders are causal functions Wt = ft{S*,X'^) and the decoders 
are as in Thcorem[2] i.e., causal in {Wt} and access the whole SI. Let us define 

RzD,EncSi{D) = mmLYifiX, S)) (7) 

where the minimization is over all deterministic functions h : Z x y ^ X and 
f : X xS Z such that Ed{X, h{Y, Z)) < D. We have the following corollary 
which follows from Theorem [2j and is proved exactly in the same way 

Corrolary 1. 

T^ZD,EncSl{D) = RzD,EncSl{D). (8) 

Unlike the case without the SI, where causal encoder SI cannot help (since 
it can be simulated by the encoder even if it is not available) , when the decoder 
has SI, encoder SI indeed helps and can be viewed as extra encoder insight on 
the decoder SI. In the extreme case when St = Yt, we obtain back the results of 
[9], adopted to the zero-delay setting. Note that with S at the encoder, {x,x') 
will be confusable of there exist (y, s) such that P{s, x,y) > and P(s, x' ,y) > 
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and thus, even if no distortion is allowed, S can be used to reduce the number 
of edges in the characteristic graph G and thus reduce the constraints on the 
RI protocol, achieving a lower average rate. 

From Theorems [l] and [2] it is apparent that as long as the encoding function 
is causal in the source, and the decoder is causal in the encoder messages, scalar 
pairs of encoders and decoders (codecs) are optimal. This suggests, as stated in 
the Introduction, that the optimal performance achievable by zero-delay systems 
is by far inferior to the performance of systems that allow arbitrary delay. The 
inquisitive reader might ask at this point which of the three constraints imposed 
by our zero-delay model (causal encoding function / causal decoding functions 
/ sequential messages with instantaneous code) is to be "blamed" for the rather 
poor performance given by Theorems [T] and [2j Had we alleviated one of these 
constraints, can optimal rate-distortion performance be achieved? 

We now show by example that at least without SI at the decoder, there 
are sources and distortion measures for which the answer to the last question 
is affirmative. To see this, think of a system with an encoder that observes 
the whole sequence X". The encoder can send no more than logjA"! bits per 
transmission and the decoders are causal (sequential) in the encoder's messages, 
meaning that Xt is calculated using the data received in the first t encoder 
transmissions. Compared to the systems described above, we only alleviated 
the constraint on the encoder. Note that we have to restrict the number of bits 
the encoder can send in each transmission, otherwise, there is no meaning to 
the decoder being sequential since the encoder can send the description for X" 
in a single transmission, as done in block-coding schemes. Such a scheme can 
be extremely useful for streaming purposes, where the whole stream is known in 
advance to the encoder and it should be streamed to a decoder without decoding 
delay. Our scheme works as follows: A classical rate-distortion random code [13] 
is used. The encoder finds the first codeword {X") in the codebook which is 
distortion- typical (as defined in 14 ) and starts to transmit the reproduction 
symbols, Xt, sequentially, using logjA"! bits each transmission (in contrast to 
the classical encoder which will send the index of the codeword in a block) . The 
decoder outputs Xt as it receives it. The idea here, is that after receiving enough 
symbols, the decoder can detect the specific codeword in the codebook since, 
with high probability, no other codeword will have the same prefix and the rest 
of the reproduction symbols can be reproduced without further transmissions 
from the encoder. We show in Appendix \K\ that such a scheme can can achieve 
optimal rate distortion performance {R{D) bits per source sample) in some cases 
and is less than one bit away from optimality in the remainder cases. Tree codes 
are another example of sequential schemes (see |15j and references therein), but 
there, the fact that the rate is constrained to be a the natural logarithm of an 
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integer usually forces the schemes to work in small blocks, unlike the scheme 
presented above. The point we make here is that at least for some combinations 
of sources and distortion functions, causality constraints only on the decoders 
and the use of instantaneous codes does not affect the performance. 

Next, we ask what is the best attainable performance if we are constrained to 
use a scalar decoder/encoder, when the other component (encoder/decoder) is 
unconstrained. For example, we have a scalar encoder, but the decoder can wait 
until it received and only then output X" or vice- verse. Such a limi- 

tation is motivated by practical constraints imposed on the encoding/decoding 
devices, for example, the encoder is a simple sensor but the decoder which re- 
ceives data from another sensor as SI can be as complex as needed for optimal 
performance. Note that for a scalar decoder, although we allow a non-causal 
encoder, we still require that the encoder will send a message every time in- 
stance and the decoder will reconstruct Xt according to this message and SI. 
Can the simplicity of one of the elements be compensated by the complexity 
of the other? The next theorem will assert that the answer to this question is 
negative. 

Let TZs-diD) denote the infimum over all rates that are achievable with a 
given D, when non-causal encoders are allowed, i.e., Wt = ft{X"), but the de- 
coders are restricted to be scalar in Wt, i.e., Xt = gt{Wt,Y^) (the s-d subscript 
stands for scalar decoder). Similarly, let TZs-e{D) denote the infimum over all 
rates that are achievable with a given D, when non-causal decoders are allowed, 
i.e., Xt — g4(Ty",F"), but the encoders are restricted to be scalar in Xt, i.e., 
Wt — ft{Xt). We have the following theorem: 

Theorem 3. 

TZs-diD) = ns-e{D) = Rzd{D). (9) 

Theorem [3] states that when cither the encoder or decoder are constrained 
to be scalar, the other side can be scalar as well without any performance loss. 
Note that for the scalar decoder case, this is true even if the decoder is scalar 
only in the encoder's messages but has full SI look-ahead. This theorem extends 
Theorem 5 of [inj to include variable rate, look-ahead and SI. 

The next two subsections will be devoted to the proofs of Theorem [2] and the 
right hand side of Theorem [sj The proof that 'R-s-d{D) — Rzoi^), which fol- 
lows the same lines, is deferred to AppendixjB] Examples are given in Subsection 
|3.4[ Finally, the causal setting will be treated in Section [375] 
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3.2 Proof of Theorem [2] 
3.2.1 Converse part 

At every stage, the encoder sends a message Wt- This message is in general a 
function of X*. The decoder at time t has aheady received and has access 

to F". Only the current and past SI, Y*, serves as SI when sending Wt since 
Yf^-^ is independent of X*. Therefore, we have 

n 

nR>^LYt{Wt\W*~^). (10) 

To avoid the complex SI structure, which depends on the time instance t, we use 
a genie aided scheme. At each time instant, a genie reveals all past SI symbols 
to the encoder and all past source symbols to the decoder. With this "genie 
aided" feedback and feed-forward, at each time instant, only Yt serves as SI 
which is not known to both parties. Therefore, the minimal average number of 
transmitted bits at each stage is lower bounded by Lyj(Wt|X*~^, F*"^). For 
any sequence of encoders which are functions of (X*,y*~^) and any sequence 
of reproduction decoders which are functions of {Wt, X*^^ ,Y"-), satisfying the 
distortion constraint, we have: 

n 

nR>Y,LYAWt\X*-\Y*-') 

71 

>Y,Ly,iWt\X'-\Y'-\Y,l,) (11) 
t=i 

= E / Ly^{Wt\x'-\y'-\y-^,)dt,{x'-\y'-\yl\,) 
t=i 

n „ 

= E / LYAft{Xt,x''\y''^)\x'~\y'~\y^+,)dt,{x''\y'-\y^+,) 

71 „ 

/ Ly,{Jt{Xt,x'-\y'-'))d^i{x'-\y'-\y-^,) (12) 

+ — 1 



where in (11 1 we used the fact that conditioning reduces the average length. In 
the next line, /x(-) denotes the joint probability mass function of its arguments 
and the last equation is true since Xt is independent of (AT*^^, F*^^, Yj" ;^). Now, 
ft{Xt,x*~^ ,y^~^) can be seen as a specific choice of f{Xt) in the definition of 
Rzd{D), Q. This, along with the fact that we know that r*"^ ^ r/^i = 
y^_^i and A'~^ = x*~^, makes the decoding function 

At = gf(/f(At,a:*"i),a;*"\j/*~\Ft, a specific choice of h{-,-) in the defi- 
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nition of Rzd{D). We therefore have 

n „ 

n „ 

> E / RzD{E[d{Xt,gt{MXt,x'-\y'-^),x'-\y'-\Yt),y^^,)\x'~\y''\yl\,] 
t=i 

d^Ji{x'~\y'-\y^^^) (13) 

> E / RzD{E[d{Xu9tUt{Xu^'-\y'-^).x'-\y'-\Yuy'l+Mx'-\y'-\ 
t=i 

y-^,])d^Ji{x'-\y'-\y-^,) (14) 



>J2RzDyl E[d{X„gt{MXt,x'~\y'-'),x'-\y'-\Yt,y^^,))\x'-\y'-' 

n 

= Y.^ZD {E [d{X„g,{MXW-'),X*-\Y^%) 
t=i 



(15) 



t=i 

n 



-ZD 



t=l 



d{XuXt] 



(1 " 
-Y,E[d{Xt,Xt) 



(16) 

> ^^^ZD (^) , (17) 

where ( 13 1 follows from the definition of Rzd [D) and the discussion following 
(12), (14) follows from the definition of Rzu{D), (15) and (16) follow from the 
convexity oi Rzoi^)- Finally, (17) follows from the monotonicity of Rzd (^)- 
Combining the above with the direct part given in the next subsection, we also 
proved that feedback of the SI and feed-forward of the source cannot improve 
performance here. 

3.2.2 Direct part 

The direct part of the theorem is obtained by time-sharing two scalar Sl-aware 
quantizers. By definition of (-D), we have that there exist {Di, D2, X) 
such that D = XDi + (1 — X)D2 and (/i, hi), (/2, /12) that are the achievers of 
Rzd (Di) and Rzd {D2), respectively, such that XRzd {D\) + {1~X)Rzd {D2) — 
Rzd (^)- Let 0i,0i be the optimal protocols for Zi^t = /i(^t);-^2,t — f2{Xt) 
respectively. Also, let fc„ < n be a non-decreasing sequence of integers such 
that lim„_^oo — = A. For every n, we use (/i,/ii) for the first /c„ stages and 
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(/2,^2) for the rest of the n-block. The resuhing Z^ t,? — 1,2, are coded with 
the optimal protocols (pi or (f)2- The average distortion of this scheme is given 

by 



n 

-Y,Ed{Xug{Y\z')) 

n ^ — ^ 



n 
t=i 

= —Ed{XtM{Yu Zi,t)) + '^^^Ed{XtM{YuZ2,t)) 
n n 

< + "^^D^ (18) 

n n 

and therefore, lim„_j.oo ^ S"=i Ed.{Xt, g{Y* , Z*)) < D. The rate of the code is 
given by 



-ELn = -J2E\MZi.t)\ + - y E\MZ2,t) 



n ^ — ' n 

t=i t=fe„+i 



^J^L{h{X)) + '^L{h{X)) (19) 
n n 



Therefore, 



R = hm —ELn 

n— foo Ti 

= XLY{fi{X)) + {l-X)LY{.f2{X)) 

= RzD {D) . (20) 

3.3 Proof of Theorem H 

In this subsection, we prove the first part of Theorem [3j namely, we show that 
Us-diD) = RzoiD). The proof that 7^s-e(£)) = Rzd{D) follows the same 
lines and is deferred to Appendix |B] Since the direct part is the same as the 
direct part of Theorem [2] we need to prove the converse only. 

We use the same genie-aided scheme as in the proof of Theorem [5] and prove 
a stronger converse: For any sequence of encoders, with possible access to the 
whole source sequence and past SI, Wt = ft{X'^,Y*~^) and any sequence of 
decoders with access to the whole SI sequence, past source symbols and the 
current encoder output, Xt — 9t{Wt, X*^^^ ,Y^), that achieve average distortion 
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D (i.e., i ELi Ed{Xt,Xt) < D), we have 

n „ 

nR>Y. LYAWt\X'-\Y'-^) 

n „ 

>^ / Ly,{Wt\X'~\X-^„Y'-\Y,\,) 

n „ 

= Ly,(M/*|a;*-\xIVi,y*-\yr+i)rfM(a:*-\xr+i,y*-\yr+i) 
t=i •' 

d/i(a;*-i,xr+i,y*-\yr+i) 

t=i 

where the last step, as before, rehed on the memorylessness of the source. Now, 
the rate in the inner expression cannot be smaher than the rate of the op- 
timal scalar system (with all conditioned elements serving as index of func- 
tions), which achieves the distortion achieved by the given decoding function, 
ht{Wt,y'-\Yt,y^+^). Therefore, 

nR>J2 / LYAMx'-\Xt,x^+„y'~'))dfi{x''\x^+„y'-\y'^+,) 

n p 

>Y. {E [d{Xt,gt{Mx'-\XuX^+„y*-'),x'-\y'-\Yt,y^+,)) 

+ 1 



x^+i, yr+i] ) yr+i) 

^E^^c(/ E[diXu9tiftix''\Xt,x^+,,y'-'),x'-\y'-\ 
\x'-\x-^„y'-\y^^,]d^i{x'-\x-^,,y'-\y-^,)^ 

n 

> Y^R^o {E [d{X„M{X"),X'-\Y")]) 

n \ 

Y^E[d{Xt,Xt) 



(22) 



> RzD 



> RzD (D) 



(23) 
(24) 



where in (22) and (23) we used the convexity of Rzoi ) its monotonicity in 
pl. 
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3.4 Examples 



3.4.1 Lossless transmission 

It is interesting to relate the above results to the lossless case which was consid- 
ered in !2j. Since D = cannot be achieved by time-sharing positive distortions, 
we get that 

Ly{X)^ mm Ly{}(X)). (25) 

hj:h{yj{x))=x 

Let Z = f{X). Any / which is a coloring of the characteristic graph G and h, 
which is the mapping from color and y back to x, are valid candidates in the 



optimization problem of ( 25 1 . If / is not a valid coloring, meaning that two 
connected xi,X2 will result in the same z, then there is no h which can result 
in zero error. In essence, we are looking for the coloring for which the restricted 
inputs protocol will produce the smallest rate. Note that when searching for the 
best coloring, our performance will be affected only by the characteristic graph 
Gz which will be built with the "source" {Z,Y). If / is a minimal coloring, i.e, 
z G {1, 2, . . . , xiCrz)}i then G^ is complete. To see this, note that for a minimal 
coloring, if zi and Z2 are not connected, then these colors can be combined 
and this reduces the number of colors, contradicting the fact that f{x) is a 
minimal coloring. Remember that a complete graph reduces the RI protocol to 
Huffman coding. This means that the SI is not helping us to code the colors. 
Therefore, looking for colorings which will induce an incomplete Gz (i.e., non- 
optimal coloring) will allow us to use the SI not only to reduce the alphabet 
of the encoder output, but also for the coding of its output (namely, relax the 
prefix condition on the codewords when the graph is complete) . In the example 
of Fig{l]i, we used a 4-coloring scheme (we had 4 different bit representations for 
the vertices of G) and not the optimal 3-coloring. Indeed, Gz for the 4-coloring 
is not complete. For a uniform source, we get an average rate of 1.4 with the 4 
coloring and if we had used a 3-coloring we would get an average rate of 1.6. 



3.4.2 Uniform source, fully connected SI model 

Let the reconstruction alphabet be the same as the source alphabet. We use the 
Hamming distortion measure {d{x,x) = ii x = x and d{x,x) = 1 otherwise). 
The encoder partitions the source alphabet into disjoint subsets Ai, A2, ■ ■ ■ , Ak, 
k < \X\. When the encoder observes a new source symbol, x, it sends the 
index of the subset containing x, using an RI protocol, as defined in Section [2] 
With the Hamming distortion measure, the average distortion is equal to the 
probability of error. Therefore, the optimal decoder is the maximum likelihood 
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decoder, namely: 

X = argmaxP(2;, zja:) = argmaxP(2/|a;). 

where z is the subset index, sent by the encoder. 

Let \X\ = |3^| = M and let for a small constant p, P{X = a) = jj, P{y — 
a\x ~ a) — 1 ~ p if a = a and P{y = a\x = a) = jfzj for any a ^ a. With this 
choice of the joint distribution, since the bipartite graph of {P{x,y)} is fully 
connected, then so is the bipartite graph of {P{y,z)}, regardless of the choice 
of Z. Therefore, the RI protocol used to describe the index of the subsets is 
reduced to a Huffman code for Z. 

It is shown in [7], that for this distortion measure and source, only the 
number of partitions, and not their content (i.e., the actual alphabet letters in 
each subset), affects the average distortion. The distortion as a function of the 
number of partitions, K, is given by — K). It turns out that in this 

case, RzD i^) ^ -^(^) p '^^^i'^f^ obtained by time-sharing the two 

trivial quantizers: the one that does not send information (R = 0, D = p) and 
the lossless quantizer {R = L{X), D = 0). 

3.4.3 Uniform source, given SI model 

We continue with the Hamming distortion measure and a uniform source with 
\X\ = 5. The channel from X to y is given in Figjs] along with the characteristic 
graph of X. In this example we set for a ^ X: P{y = a\x — a) = I — p and 




Figure 3: P{y\x) and the resulting characteristic graph. 

a — ^p, b — ^p, c = j^p, d — |p, e — jp. Note that the chromatic number of 
G is 4. This means that any partitioning of the alphabet of X into less than 
4 subsets will incur a lossy reconstruction. Unlike the previous example where 
the SI could be used only in the reconstruction, but not to reduce the length 
of the transmission (since G was fully connected) , here it will be used for both. 
Note that as in the example of FigJT} the optimal rate for lossless transmission is 
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actually obtained by using more subsets than the chromatic number of X. The 
optimal two subset partition (|Z| = 2) is {1, 4} , {2, 3, 5}, yielding an average 
distortion of The rate for this (and any binary) partition is 1. The optimal 
3-subset partition is {1, 4} , {2, 5} , {3} yielding an average distortion of ^p. The 
average rate for this partitioning is 8/5. However, in this case it is beneficial 
to split {2,5} and obtain a lower rate of 7/5 (using the SI to alleviate the 
prefix requirement). Although we use more subsets, the rate is reduced, as 
discussed in the remarks that followed the statement of Theorem [l] In Fig|4j 
we compare Rzd (^) the performance of a system that uses the minimal 
number of subsets for each D and uses Huffman codes, i.e., a system that uses 
the SI only for the reproduction but not for the compression. 




0.35 



D 

Figure 4: Rzd i^) (solid) compared to a system that uses the SI only for 
reproduction (dotted) with p = 0.3. 

3.5 The Causal Setting 

In this section, we revisit the causal setting considered in [5] and 0. In this set- 
ting, we still restrict the encoding and decoding functions to be causal functions 
of the their respective inputs, however, we do not restrict the delay which is 
imposed by the system. Effectively, this means that we drop the instantaneous 
code constraint and allow more sophisticated coding techniques which introduce 
delays. In 0, [9], the SI was either not present or was not used in the recon- 
struction of Xt (unless it was available at both the encoder and decoder) and 
therefore, the encoder could calculate the reproduction symbols. This in turn, 
enabled to recast the system into a reproduction coder (which creates |^t|) 
followed by a lossless entropy coder, which coded X" into a message that was 
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sent to the decoder (this is where the delay was introduced). However, when 
the SI is available only at the decoder and used in the reconstruction, the en- 
coder cannot reproduce 1^*1 ^-nd therefore the encoder and the reconstruction 
decoder must be decoupled. In [3], the model we treat here was left unsolved. 
The combination of encoding that introduces delay with causal decoding func- 
tions is less motivated in practical terms than the zero-delay models that were 
introduced in the previous subsections. However, it can represent situations 
where the lossless coding of the encoder messages is done separately and then 
streamed to a simple causal decoder. The results which show that scalar codecs 
are optimal support such a model in retrospect. 

In contrast to the zero-delay model, where synchronization must be kept be- 
tween encoder and decoder over the bit-stream and therefore no decoding errors 
are allowed, here, since block coding is allowed, we have no such constraint. A 
block error with a vanishing probability will translate into a vanishing additional 
distortion as with the classical rate-distortion solutions. 

Let TZc{D) denote the infimum over all rates such that there exist causal 
encoders Wt — ft{X^), decoders which are causal in the encoder's messages 

= gtiW\V) and define 



where the minimum is over all functions f : X —i- Z and h : Z x y ^ X such 
that Ed{X,h{f{X),Y)) < D. Finally, define R^{D) to be the lower convex 
envalope of Rc{D). We have the following theorem: 

Theorem 4. 



We will only outline the proof since after a few steps it is similar to the 
proofs of the previous theorems when replacing the average length functional 
with the entropy functional. In the direct part, we again timeshare at most 
two scalar encoders and then code the resulting blocks using a Slepian-Wolf [T7] 
code. For the converse, we start with the converse to the lossless source coding 
with SI, stating that nR > While this step is valid when the SI in 

known at both sides when losslessly coding W", and might seem a loose lower 
bound here, we know from the Slepian-Wolf theorem that there is essentially no 
loss when the SI is available only at the decoder and a vanishing probability of 



Rc{D)^mlH{f{X)\Y) 



(26) 



n,{D)=R^{D). 



(27) 
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block error is allowed. We have: 



nR > i/(T4^"|r") 

n 

t=i 

n 

t=i 

n 

= Y,H{Wt\X'-\Y^) 

n „ 

t=i •' 

n „ 

= E / H{J,{X,,x'~^)\Y,)dy.{x'-\y'-\yl\^) (28) 



t=i ■ 



From here, one can identify that Rc{D) can further lower bound the conditional 
entropy in the last equation. The rest of of the proof essentially follows the 
steps of the proof of Theorem [2] after equation (13). 



4 Multiterminal Problem 

In this section we revisit the multiterminal source coding problem (two-user 
Wyner-Ziv problem), where two correlated random variables are observed sepa- 
rately by two non-cooperative encoders who communicate with a decoder. The 
decoder needs to reconstruct both sources and the distortion between the re- 
constructions and the original variables should not exceed a given threshold. 
As mentioned in the Introduction, this problem has been open for the general 
setting for over three decades. There are specific cases in which the arbitrary 
delay problem can be solved for a general source and distortion measure. When 
no distortion is allowed, this is the Slepian-Wolf problem [T^ and when one of 
the variables is known to the decoder, this is the original Wyner-Ziv problem 
Other examples include the source coding with side information of Ahlswede- 
Korner and Wyner [TS] , [H] where arbitrary distortion is allowed for one of the 
sources and the other source should be reconstructed losslessly. Finally, Beger 
and Yeung [20 considered a setting where one of the sources is to be perfectly 
reconstructed and the other source should be reconstructed with a distortion 
constraint (their setting subsumes all previous examples). Recent results for 
specific sources or distortion measures include the achievable rate distortion re- 
gion for the quadratic Gaussian multiterminal source coding problem, given by 
Wagner, Tavildar, and Viswanath [21] and the characterization of the region 
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under logarithmic loss given by Courtade and Weissman ^2] . 

We consider a zero-delay version of the multiterminal source coding problem. 
We show that, unlike the arbitrary delay setting, when the zero delay constraint 
is imposed, this problem can be readily solved and the rate distortion region 
can be characterized. 



4.1 Formal Definition and Main Result 

We start by formally stating the problem. Let {Xt,Yt) be emitted by a mem- 
oryless source. Each of the symbols are encoded by fi,ff respectively to pro- 
duce Wt = f,^tiX'),Zt = fyAY')- A decoder, gt : W x ^ X x y, uses 
(T4^*, Z*) to reproduce the pair {Xt,Yt)- Let {gx,t,gy,t) denote the reproduction 
functions of X, Y respectively. Given two distortion measures, '■ X x X ^ 
]R.+ ,dy : y X y ^ 1R.+ , it is required that Ed^iXt, Xt) < D,, and 

^Y^^=i ■^^v0^t,Yt) < Dy simultaneously. Each encoder encodes its message 
with an instantaneous code. As noted in the Introduction, there is more than 
one way to define the zero-delay decoding in a multiterminal setting. We allow 
either Zt or Wt to be decoded first by the decoder and therefore the code of 
the later symbol (and generally all past decoded symbols) can serve as SI. Let 
lt,lt denote the number of bits transmitted until the time that Xt and Yt are 
encoded (inclusive), respectively. The system model is depicted in Figjs] 



X-, 



Y: 











10100... ^ 
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f,{Y') 








01100... 







X. 
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Figure 5: Two users system model 

Let 5 denote the subset of R** consisting of all quadruples [R^.Ry, Dx,Dy) 
such that for any given e > there exists a sequence functions {fx,t} , {fy,t} , {fft} 
and an integer n such that 

i. lEl-<R, + e 

ii. lEll<Ry + e 



^,EtiEdx{Xt,Xt) <Dx + e 
iv. ^EtiEdy{Yt,Yt)<Dy + e 
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Our goal is to characterize the above operationally defined region by single- 
letter information-theoretic functionals. Towards this goal, let us define A^^^ to 
be the set containing quadruples {Rx, Ry,Dx,Dy) such that there exists random 
variables (C/, V, Q) that belong to finite alphabets U, V, Q, \Q\ < 5, \U\ < \X\ x 
|SU"I^| < |3^| X \Q\ which are jointly distributed with {X,Y) and satisfy 

1. U = fx{X,Q),V = fy{Y,Q), for some functions fx ■ X x Q U , fy : 
y X Q^V,Q is independent of {X, Y). 

2. 3gx,gy such that gx'-UxVxQ^X,gy: UxVxQ^y satisfying 
Edx{X,gx{U,V,Q)) < Dx and Edy{Y, gyiU,V,Q)) < Dy. 

3. Rx > Lv{U\Q) 

4. Ry > L{V\Q) 

The subscript in Ayx stands for the order of transmission, where the message 
of the F-encoder, V, is decoded first and serves as SI to the message from the 
X-encoder. We similarly define A^^y in the same way but with reversed trans- 
mission order, i.e., items |3j?? is defined as Rx > L{U\Q) and Ry > Lu{V\Q). 
Finally let A* = A^^y U A^^ and define A* to be the closure of A*. 
The main results of this section is the following theorem: 

Theorem 5. S = A* . 

Theorem [5] basically states that using scalar encoders and decoders is op- 
timal. The role of the encoder that transmits first is to send a message that 
will not only allow the joint reconstruction, but also to serve as SI for the de- 
coding of the other message, thus reducing the average length which is needed 
for the other message. In the proof of the converse, we will show that there 
is no performance gain if each of the encoders receive the observations of the 
other encoder with delay (without delay, this becomes a single user problem 
for the source {X, Y) with two distortion measures) . There are several obvious 
extensions to this problem. For example. We can consider breaking each user's 
message into small pieces and consider all protocols that order the transmission 
of the pieces between the users, where all previously transmitted pieces serve as 
SI. Then, we can optimize over all possible such protocols (since the alphabets 
of X and Y are finite, there is a finite number of such protocols). Additionally, 
SI can be available to the decoder. Also, the encoder that transmits first might 
share some information with the second encoder at a given rate. The extension 
to such scenarios follows the same lines of proof as those of Theorems [T][5] and 
is therefore omitted. Another variant of this problem is a two-way problem 
where the X-encoder and ^-encoder communicate, where each user reproduces 



24 



the other source to within a given distortion. It can be shown that in such a 
problem the past is, again, irrelevant and scalar encoders are optimal. In such 
a scenario, the interaction can be used both to assist the reconstruction and to 
reduce the average length which is needed to convey each message, as shown by 
Orlitsky in ^5] . 

We prove the converse and direct parts in the next two subsections, respec- 
tively. 



4.2 Converse: 

We prove the converse assuming Zt is sent first. The proof when Wt is sent first 
is the same. Once again, we use a "genie aided" scheme, where each encoder has 
access to (X*"'^, F*"-'^) when encoding Xt or Yj. We will show that even when 
each encoder knows the past symbols of the other source, still scalar encoders 
are optimal. With this scheme the only element that is unknown to each of the 
encoders is the current message of the other encoder and the decoder's data is 
shared by both encoders. 

For any sequence of encoders {/^.^^(X*, F*"^)}, F*)} and de- 

coders fe,t(W^*,Z*}), {gy^t{W\Z'}) achieving {R,, Ry, D,, Dy) e 5 the fol- 
lowing holds: 

n 

nRy >J2HZt\X'-\Y'-^) 
t=i 

n 

= ^ L{fy,t{X'-\Y'-\Yt)\X*-\Y'-^) 

n 

= Y.L{fy^t{Qt,Yt)\Qt) (29) 
t=i 

where we set Qt = (-'^*~^, ^*~^), which is independent of {Xt,Yt). We continue 
by letting M be a random variable distributed uniformly over {l,2,...,n}, 
independent of all other variables. Continuing (29) we have 

n 



1 " 

-y^L{fy^t{Qt,Yt)\Qt,M^t) 



n 



L{fy{M, Qm,Ym)\Qm.M) 
L{fy{Q,Y)\Q) 

L{V\Q) (30) 
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where Q = {Qm,M)^ and V = fy{Q,Y) and we used the fact that F's distri- 
bution does not depend on the time. 

Using the same definitions of M, Qt,Q,V and following the same steps, we 
have for the rate of the X-encoder: 

n 

R.>-Y.LzAWt\X'-\Y'-') 
t=i 

n 

= -J2HAx^-KY^-KYMUtiX'-\Y'-\X,)\X'-\Y'-') 
t=\ 
1 " 

= „ X]-^/„,t(Qt,n)(/^.t('5t,^t)IQt) 
t=i 

1 " 

t=l 

= Lr 

jy 

= Lv{U\Q) 



(31) 



where f{Q,X). 



For the distortion we have that: 
1 " 

> - Y^d^iXuXt) 
t=i 

1 " 

= -J2Ed,iXt,g,4W\Z')) 
n ^ — ' 

t=i 

1 " 

>-Y.E ~g^,,{X'-\Y'-\!^,,{X'-\Y'-\X,)Jy^,{X'-\Y'-\Yt)))\ 



t=i 



(32) 



1 " 

-Y^E \d^{Xtr9.AQu f.AQu Xt)Jy,t{QuYt))) 
t=l 

1 " 

-Y,E [d^Xtrg.AQu h.AQu Xt)jyAQuYt)))\M = t 
t=i 

E [d,{XM, -gAM, QmJAM, Qm, XM)Jy{M, Qm,Ym))) 



= E[dAX,gx{Q,U,V))] 



(33) 



where in ( 32 ) we used the fact that giving the decoder access to {X ,Y ) can 
only decrease the average distortion. In exactly the same manner, by replacing 
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the roles of X and Y , we get 



Dy>E[dy{Y,~gy{Q.U,V))]. (34) 
The bound on the cardinahty of Q is a consequence of Caratheodory's the- 



orem [24]: As can be seen in (35 1- (38 1, the point 



A 



{Lv{U\Q),L{V\Q),Ed,{X,g^{U,V,Q)),Edy{X,gy[U,V,Q))) 

Ues in the convex hull of the set {Lv{U\Q — q), L{V\Q ~ q), Dx{q), Dy{q)} which 
is a subset of R*. Therefore, by Caratheodory's theorem, this region will not 
change if we restrict the alphabet size of Q to 5. 

4.3 Direct: 

In order to show that A* C S we timeshare scalar quantizers. For every quadru- 
ple {Rx,Ry, Dx,Dy) e A* We have 

ICI 

LviUlQ) = = q)Lv{U\Q = q) 

g=l 

ICI 

5^P(g = q)i?,(g), (35) 
9=1 
ISI 

L{V\Q) = Y,PiQ ^ q)L{V\Q ^ q) 

9=1 

ISI 

= Y,PiQ = q)Ryiq), (36) 

9=1 

ISI 

EdxiX, g{U, V, Q)) = ^ P(g = q)E [d^iX, g^iU, V, Q))\Q = q] 

9=1 

ISI 

Y,PiQ^q)DAq), (37) 

9=1 

ISI 

EdxiX,giU, V, Q)) = ^ P(Q = q)E [dy{Y, gy{U, V, Q))\Q = q] 

9=1 

ISI 

Y,PiQ = q)Dy{q). (38) 

9=1 



A 



A 



Now, for every q £ Q, we will use the scalar encoders fx{X,q), fy{Y,q) which 
define U and V (using rates Rx{q), Ry{q) respectfully) and the scalar decoders 
gx{U,V,q) and gy{U,V,q) to reconstruct X,Y with average distortions Dx{q) 



27 



and Dy{q) respectfully. Now for n large enough, we can find ni, n2, . . . , rt|g| 
such that — ^iid n-q/n is arbitrarily close to P{Q = q). For each 

9 € Q, we use the encoder and decoders pertaining to Q = q for Uq coding 
stages. With this timesharing we can approach {R^ + e,Ry + e, + e,Dy + e) 
for any e > if n is large enough. 

5 Conclusions 

In this work we derived the fundamental limits of zero-delay lossy source coding 
with SI. It was shown that for the single-user setting, results in the spirit of 
[5j continue to hold in the sense that time-sharing scalar quantizers, followed 
by SI aware encoders is optimal. These codes have the feature that increasing 
the encoder's alphabet size can reduce the average length and not the more 
intuitive opposite, which is true when SI is not available and Huffman codes are 
used. Furthermore, we showed that in the multiterminal setting, the zero-delay 
variant of the multiterminal source coding problem can be readily solved, unlike 
the arbitrary-delay problem, which is open for three decades. We also discussed 
possible extensions which can be proved following the same methods we used 
in this paper. Although we discussed only finite alphabets, we believe that 
extending the results of this paper to continuous alphabets in the multiterminal 
setting or continuous alphabets with discrete SI (for example if the decoder has 
another quantized version of the source as SI) in the single-user setting should 
be straightforward. 

It was shown that without SI, there are cases where there is no loss in forc- 
ing causal/sequential decoding, compared to standard rate-distortion decoding 
which waits for the whole block to arrive before producing the reconstruction. 
In other all other cases, we showed that the redundancy, compared to optimal 
rate-distortion performance is less than one bit when variable-rate coding is 
allowed. We used a simple scheme that relied on the classical random rate- 
distortion codebook construction, and only changed the way the encoder and 
decoder operate. However, trying to do the same when SI is available results 
in a substantial performance degradation. It is an interesting question whether 
there is an unavoidable loss when restricting to sequential decoding in these 
cases or whether our scheme, which works without SI, is too naive when SI is 
available. 
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Appendix 



A Details of the Sequential Scheme 

In this appendix, we show the analysis of the sequential scheme which was briefly 
described in Section 13.11 This scheme uses a classical rate-distortion random 
code [13]. The encoder finds the first codeword, X", in the codebook which 
is distortion-typical (as defined in jl4j ) and starts to transmit the reproduction 
symbols, Xt, sequentially, using logj/Yj bits each transmission (in contrast to 
the classical encoder which will send the index of the codeword in a block) . The 
decoder outputs Xt as it receives it. The idea here, is that after receiving n ■ a 
symbols, the decoder can detect the specific codeword in the codebook since, 
with high probability, no other codeword will have the same prefix and the 
rest of the reproduction symbols (^,"q,+2) '^^^ be reproduced without further 
transmissions from the encoder. 

We start by analyzing the probability that given a sequence, if we now 
randomly draw a codebook, we will draw another sequence with the same prefix. 
Formally, Let X be drawn with an i.i.d source P{x). We first draw a single 
sequence of length n according to P{x) — Y[t=i P{^t), then we independently 
draw 2"^ sequences using the same source. We ask what is the probability that 
none of the 2"^ sequences start with the same na symbols prefix of the first 
sequence? We average this over the probability to draw a specific sequence in 
the first drawing. If we found an a for which this probability vanishes as n 
grows, this will give us a bound on the number of symbols we need to send 
in our sequential scheme in order to identify the correct codeword. Let this 
probability be denoted by Pc- We will treat only the first na symbols since we 
do not care about the contents of the rest. We use the method of types. Px will 
denote the empirical distribution of x, i.e., Px{x) = Nx{x)/na where Nx{x) 
counts the number of occurrences of the symbol x in x. H{Px) will denote the 
empirical entropy of the empirical distribution Px{')- We use standard results 
of the method of types which can be found in 24j. 

After we have the first sequence, the probability that it will not be the 
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outcome of the next drawing is 1 - 2-""(^(^a;IIP)+ff(^'a;)). Therefore 

X 

Px 

> _ 2-naH(P)^ (A.l) 

where the second equahty is true since the the expression in the first hne depends 
only on the type-class of x and the last inequality is true since from the sum over 
all type-classes, we took only Px = P (we assume that n is large enough so we 
can get as close as we want to P, even if it is not rational). The last expression 
converges to unity double exponentially fast if aH{P) > R or cquivalently, 

We now use this result to analyse a coding system with sequential decoders. 
We draw a codebook consisting of 2"(^(^'+'^) codewords according to the prior 
on X that is calculated from the channel that achieves the (standard) rate 
distortion function. Given a source sequence, X", to be encoded, the en- 
coder looks for a sequence in the codebook, X", that is distortion typical (see 
[14j ) with the source sequence and starts sending the reproduction symbols 
Xt, t — 1, 2, ... , na, sequentially, log \X\ bits at a time. The decoder outputs 
the reproduction symbols sequentially as it receives their description. With the 
above result, if a = ^[^\ + 2e, there is no other codeword in the codebook 

H{X) 

with the same prefix with probability that converge to unity double exponen- 
tially fast. Therefore, after receiving n (^^^ + 2e^ log \X\ bits, with very high 
probability, the decoder knows the whole sequence X" and can output the rest. 
For the whole rt-block, we send no more than ^ n (^^^ + 2e^ log \ X\ bits per 
source symbol. This means that whenever the rate-distortion achieving prior 
is uniform, we achieve the optimal rate-distortion performance with this simple 
sequential scheme (since H(X) = \og\X\). Otherwise, we are away from op- 
timality by a factor of ^-j^^- This factor can be reduced if we allow variable 
rate coding, i.e., relax the constant number of bits per transmission constraint 
to a constraint on the average number of bits per transmission. If variable rate 
coding of Xt is allowed, we can upper bound the average number of bits we send 
each transmission by H{X) -I- 1. In this case the the number of bits per source 
symbol we send is upper bounded by R{D) + < R{D) + 1, meaning we 

are less than one bit away from optimal performance with this simple sequen- 
tial scheme for any source and distortion measure. Note that, essentially, we 
used the same random code used in the classical analysis of the achievability of 
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the rate-distortion results. The error event is the union of the events that the 
random code will not cover a specific source sequence with distortion D and the 
event that the codeword that describes this source sequence with distortion D 
has another codeword with the same prefix. The complete analysis that shows 
that there exists a codebook for which our scheme will work is the same as the 
one found in [Ml Chapter 10.5] and is therefore omitted. 

B Proof of Theorem [3] 

In this appendix, we prove the first part of Theorem |3j namely, we show that 
TZs-e{D) = RzjjiD). Once again, the direct part is the same as the direct part 
of Theorem [2j we need to prove only the converse. 

When the encoder is scalar (mcmoryless) , it cannot use the past transmitted 
messages as SI. Therefore, for each message, at least LytiWt) are sent. We have 

n 

nR>Y, LyAMXt)) 

n „ 

>E / LYjMXt)\W''\W[^_,„Y''\Y,l,) 

n „ 

> E / LYAftiXt)\w'-\wl\,,y'-\y-^,)dfi{w'-\w^_,,,y*-\y^^,) 

n „ 

> E / LYAftiXt))d^,iw'-\w-+,,y'~\y-+,) (B.l) 
t=i •' 

Now, the rate in the inner expression cannot be smaller than the rate of the opti- 
mal scalar system (with all conditioned elements serving as index of functions) , 
that achieves the same distortion as the given system (with decoder functions 
ht{w'-\Wt,y'-\Yuy^+,) 

n „ 

nR>Y, LYJMXt))d^^{w'-\wl\,,y'-\y'^+,) 

n „ 

>Y. S^ZD {E [d{Xu9t{ft{Xt),w''\w^^„y'-\Y,,y^^,))]) x 
t=i •' 

df,iw'-\wl\,,y*-\y^^,) 

(B.2) 

From here, using the concavity and monotonicity of Rzoi') ^^'^ following the 
steps of the proof of Theorem [2] the proof will be concluded. 
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